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Abstract 

The  increasingly  high  relative  cost  of  moving  data  on  modern  parallel  machines  has  caused  a 
paradigm  shift  in  the  design  of  high-performance  algorithms:  to  achieve  efficiency,  one  must  focus  on 
strategies  which  minimize  data  movement,  rather  than  minimize  arithmetic  operations.  We  call  this 
a  communication- avoiding  approach  to  algorithm  design. 

In  this  work,  we  derive  a  new  parallel  communication-avoiding  matrix  powers  algorithm  for  matrices 
of  the  form  A  =  D  +  U SVH ,  where  D  is  sparse  and  U SVH  has  low  rank  but  may  be  dense.  Matrices  of 
this  form  arise  in  many  practical  applications,  including  power-law  graph  analysis,  circuit  simulation, 
and  algorithms  involving  hierarchical  (' H )  matrices,  such  as  multigrid  methods,  fast  multipole  methods, 
numerical  partial  differential  equation  solvers,  and  preconditioned  iterative  methods. 

If  A  has  this  form,  our  algorithm  enables  a  communication-avoiding  approach.  We  demonstrate 
that,  with  respect  to  the  cost  of  computing  k  sparse  matrix-vector  multiplications,  our  algorithm 
asymptotically  reduces  the  parallel  latency  by  a  factor  of  0(k )  for  small  additional  bandwidth  and 
computation  costs.  Using  problems  from  real-world  applications,  our  performance  model  predicts  that 
this  reduction  in  communication  allows  for  up  to  24  x  speedups  on  petascale  machines. 


1  Introduction 

The  runtime  of  an  algorithm  can  be  modeled  as  a  function  of  computation  cost,  proportional  to  the 
number  of  arithmetic  operations,  and  communication  cost,  proportional  to  the  amount  of  data  movement. 
Traditionally,  to  increase  the  efficiency  of  an  algorithm,  one  sought  to  minimize  the  arithmetic  complexity. 
On  modern  computer  systems,  however,  the  time  to  move  one  word  of  data  is  much  greater  than  the  time  to 
complete  one  floating  point  operation.  This  can  cause  the  performance  of  algorithms  with  low  arithmetic 
intensity  (see  [11])  to  be  communication  bound.  Technology  trends  indicate  that  the  performance  gap 
between  communication  and  computation  will  only  widen  in  future  systems.  This  has  resulted  in  a 
paradigm  shift  in  the  design  of  high-performance  algorithms:  to  achieve  efficiency,  one  must  focus  on 
strategies  which  minimize  data  movement,  rather  than  minimize  arithmetic  operations.  We  call  this  a 
communication- avoiding  approach  to  algorithm  design. 

We  consider  simplified  machine  models,  where  a  sequential  machine  moves  words  of  data  between  a 
slow  memory  of  unbounded  capacity  and  a  fast  memory  of  size  M  words,  and  a  parallel  machine  consists 
of  p  processors,  who  move  data  between  their  local  memories  each  of  size  M  words.  In  both  cases,  data 
has  a  fixed  width  and  is  moved  in  messages  of  up  to  M  contiguous  words.  In  the  parallel  case,  we 
assume  the  processors  communicate  in  a  point-to-point  fashion  over  a  completely  connected  network  (so 
no  contention)  and  each  processor  can  send  or  receive  at  most  one  message  at  a  time.  We  characterize 
an  algorithm  by  its  latency  cost,  the  number  of  messages  sent,  its  bandwidth  cost ,  the  number  of  words 
moved,  and  its  arithmetic  (flop)  cost,  the  number  of  arithmetic  operations  performed.  We  characterize  a 
sequential  or  parallel  machine  by  its  latency  a,  reciprocal  bandwidth  (3,  and  arithmetic  (flop)  rate  7,  in 
addition  to  M  and  p,  and  will  estimate  the  runtime  T  of  an  algorithm  (along  the  critical  path)  as 

T  =  (^messages  •  a)  +  (#words  moved  •  /3)  +  (#flops  -7).  (1) 

For  simplicity,  we  will  not  discuss  overlapping  communication  and  computation,  although  this  potential 
factor  of  2  savings  may  be  important  in  practice. 
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Computing  k  repeated  sparse  matrix-vector1  multiplications  (SpMVs),  or,  a  matrix  powers  computa¬ 
tion,  with  A  £  Cnxn  and  x  £  Cnxq ,  where  typically  q  <C  n,  can  be  written 

Kk+i(A,{Pj}j=o,x)  ■=  [a;(0),...,:r(fc)]  :=  \p0(A)x,p1(A)x, . . .  ,pk(A)x],  (2) 

where  pj  is  a  degree- j  polynomial.  For  i  =  1  to  q,  the  span  of  the  *th  columns  of  x^°\x^\ ...  is  often 
called  a  Krylov  (sub) space.  Matrix  powers  computations  constitute  a  core  kernel  in  a  variety  of  appli¬ 
cations,  including  steepest  descent  algorithms,  the  power  method  to  compute  PageRank,  and  Krylov 
subspace  methods  for  linear  systems  and  eigenvalue  problems.  Due  to  its  low  arithmetic  intensity,  SpMV 
performance  is  communication  bound  on  modern  architectures. 

In  [4],  the  authors  derive  parallel  communication-avoiding  matrix  powers  algorithms  to  compute  (2) 
that  achieve  an  0(k)  reduction  in  parallel  latency  cost  versus  computing  k  repeated  SpMVs,  for  a  set 
number  of  iterations  [4,  9].  These  algorithms  assume  that  A  is  well  partitioned ,  i.e. ,  we  can  partition 
A  such  that  for  each  processor,  computing  powers  up  to  Ak  involves  communication  only  between  0(1) 
nearest  neighbors. 

Although  such  advances  show  promising  speedups  for  a  wide  variety  of  problems,  the  requirement  that 
A  is  well  partitioned  in  the  communication-avoiding  matrix  powers  formulation  of  [4]  excludes  matrices 
whose  partitions  have  poor  surface-to-volume  ratio.  In  this  work,  we  consider  matrices  of  the  form 
A  =  D  +  USVH,  where  D  is  well  partitioned  and  USVH  may  not  be  well  partitioned  but  has  low 
rank.  (Recall  xH  =  xT  denotes  the  Hermitian  transpose  of  x,  and  that  if  an  n-by-n  matrix  has  rank  r, 
it  can  be  stored  and  applied  to  a  vector  with  0(nr )  words  and  flops,  instead  of  the  usual  0(n2).)  There 
are  many  practical  situations  where  such  structures  arise,  including  analysis  of  power-law  (or  scale-free) 
graphs  representing  the  Web  and  social  networks,  and  circuit  simulation.  Hierarchical  ((H)  matrices  (see, 
e.g.,  [1]),  which  appear  in  numerical  partial  differential  equation  solvers,  multigrid  methods,  fast  multipole 
methods,  and  preconditioned  iterative  methods,  also  have  this  form. 

In  this  paper,  we  derive  a  new  parallel  communication-avoiding  matrix  powers  algorithm  for  matrices 
of  the  form  A  =  D  +  USVH ,  where  we  only  require  that  D  (not  A)  be  well-partitioned.  If  this  splitting 
is  possible,  our  algorithm  enables  a  communication-avoiding  approach;  we  demonstrate  that,  with  respect 
to  the  cost  of  computing  k  SpMVs  (the  standard  algorithm),  our  algorithm  asymptotically  reduces  the 
parallel  latency  by  a  factor  of  O(k)  for  small  additional  bandwidth  and  computational  costs.  Using  detailed 
complexity  analysis,  our  performance  model  predicts  that  this  reduction  in  communication  allows  for  up 
to  24 x  speedups  over  the  standard  algorithm  on  petascale  machines. 

2  Background  and  Related  Work 

We  discuss  work  in  related  areas,  namely  parallel  communication-avoiding  matrix  powers  algorithms  and 
the  serial  blocking  covers  technique.  We  touch  on  applications  that  can  benefit  from  our  approach. 

2.1  Efficient  Matrix  Powers  Algorithms 

The  communication-avoiding  matrix  powers  algorithms  derived  in  [5]  fuse  together  a  sequence  of  k  SpMV 
operations  into  one  kernel  invocation.  Depending  on  the  nonzero  structure  of  A  (more  precisely,  of  the 
polynomials  {Pj(A)}k=1),  this  enables  communication  avoidance  in  both  serial  and  parallel  implementa¬ 
tions. 

The  serial  matrix  powers  kernel  reorganizes  the  k  SpMVs  to  maximize  reuse  of  A  and  the  k+  1 
vectors,  ideally  reading  A  and  x  once,  and  writing  the  k  output  vectors  only  once.  When  reading  A  is 
the  dominant  communication  cost  versus  reading/writing  the  vectors  (a  common  situation),  this  implies 
an  fc-fold  decrease  in  both  latency  and  bandwidth  cost. 

In  parallel,  the  matrix  powers  kernel  reorganizes  the  computation  in  a  similar  way  but  with  a  slightly 
different  goal,  since  in  a  parallel  SpMV  operation  only  vector  entries  must  be  communicated.  The  parallel 
matrix  powers  optimization  avoids  interprocessor  synchronization  by  storing  some  redundant  elements 
of  A  and  x  on  different  processors  and  performing  redundant  computation  to  compute  the  k  matrix 
powers  without  further  synchronization.  Provided  the  additional  communication  cost  to  distribute  a:  is  a 
lower-order  term  (equivalently,  Ak  is  well  partitioned),  this  gives  fc-fold  savings  in  latency. 

1  Wlien  q  >  1,  an  ‘SpMV’  is  really  a  sparse  matrix-dense  matrix  multiplication  (SpMM) 
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We  note  that  this  discussion  applies  even  if  A  is  not  given  explicitly. 

Serial  and  parallel  variants  of  the  matrix  powers  kernel,  for  both  structured  and  general  sparse  matrices, 
are  described  in  [8] ,  which  summarizes  most  of  [4]  and  elaborates  on  the  implementation  in  [9] .  Within  [8] , 
we  refer  the  reader  to  the  complexity  analysis  in  Tables  2. 3-2. 4,  the  performance  modeling  in  §2.6,  and  the 
performance  results  in  §2.10.3  and  §2.11.3,  which  demonstrate  that  this  optimization  leads  to  speedups 
in  practice.  For  example,  for  a  2D  9-point  stencil  on  a  rd^-by-n1/2  mesh  with  p  processors,  assuming 
k  <C  (n/p)1/2,  the  number  of  arithmetic  operations  grows  by  a  factor  1  +  2k(p/n)1^2 ,  the  number  of 
messages  decreases  by  a  factor  of  k,  and  the  number  of  words  moved  grows  by  a  factor  of  1  +  kljp/n)1^2  [8] . 
Therefore  since  the  additional  arithmetic  operations  and  additional  words  moved  are  lower  order  terms, 
we  expect  the  parallel  matrix  powers  algorithm  of  [8]  to  give  a  O(fc)  speedup  on  latency-bound  problems. 

2.2  The  Blocking  Covers  Technique 

Many  earlier  research  efforts  sought  to  reorganize  out-of-core  algorithms  to  minimize  data  movement 
within  the  memory  hierarchy  on  a  single  processor.  In  [6],  Hong  and  Kung  prove  a  lower  bound  on 
I/O  complexity  for  a  matrix  powers  computation  on  a  regular  mesh:  if  the  computational  graph  has  a 
t -neighborhood  cover ,  then  the  number  of  I/Os  can  be  reduced  by  a  factor  of  r  over  the  standard  method. 
This  is  the  same  restriction  required  in  the  matrix  powers  algorithms  in  [4],  namely,  that  A  be  well 
partitioned. 

A  shortcoming  of  Hong  and  Kung’s  technique  is  that  certain  graphs  with  low  diameter  (such  as 
multigrid  graphs)  cannot  be  covered  efficiently  with  small,  high-diameter  subgraphs.  In  [7],  this  restriction 
is  overcome  by  relaxing  the  constraint  that  dependencies  in  the  computational  graph  be  respected,  using 
a  blocking  covers  technique.  A  blocking  cover  of  the  computational  graph  has  the  property  that  the 
subgraphs  forming  the  cover  have  large  diameters  (equivalent  to  our  definition  of  well  partitioned)  once 
a  small  number  of  vertices  (the  blockers)  have  been  removed.  The  authors  show  that,  by  introducing 
a  variable  for  each  blocker  for  each  of  the  r  iterations  and  maintaining  linear  dependences  among  the 
removed  vertices,  each  subgraph  can  compute  r  matrix  powers  on  its  data  without  communicating  with 
other  subgraphs.  We  extend  the  blocking  covers  technique  of  [7]  to  the  parallel  case,  and  further  generalize 
this  technique  to  handle  a  larger  class  of  data-sparse  matrix  representations. 

2.3  Motivating  Applications 

Many  scientific  applications  require  solving  linear  systems  Ax  =  b.  Iterative  methods  are  commonly  used 
when  the  coefficient  matrix  is  large  and  sparse.  The  most  general  and  flexible  class  of  iterative  methods  are 
preconditioned  Krylov  subspace  methods.  In  each  iteration  ?n,  the  approximate  solution  is  chosen  from 
the  expanding  Krylov  subspace,  lCm+\ (M~1A,  x )  =  span(a;,  ( M~1A)x , . . . ,  (M~1A)mx),  or  some  variation 
(depending  on  the  preconditioner  At-1  used).  This  includes  the  power  method,  e.g.,  in  the  PageRank 
algorithm. 

Significant  reductions  in  communication  can  be  achieved  by  rearranging  the  iterations  to  compute  a 
k  element  basis  in  one  step  (a  matrix  powers  computation),  essentially  unrolling  the  inner  loop  k  times. 
Such  methods  are  called  fc-step,  or  communication-avoiding,  Krylov  subspace  methods.  There  is  a  wealth 
of  literature  related  to  fc-step  methods.  Space  constraints  prohibit  an  in-depth  discussion;  we  direct  the 
reader  to  the  thorough  overview  given  in  [5,  §1.5-1. 6]. 

As  mentioned  above,  there  are  many  applications  where  iterative  methods  are  used  for  matrices  with 
the  property  of  being  mostly  sparse,  but  with  a  few  low-rank  components  that  may  be  dense.  This 
includes  matrices  from  Web  graphs,  social  networks,  and  circuit  simulations.  Hierarchical  semiseparable 
(HSS)  matrices  [3],  commonly  used  in  PDE  solvers,  also  have  this  property.  Our  technique  is  general  and 
can  be  implemented  to  avoid  communication  in  all  these  applications. 
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3  Derivation 


Recall  that,  given  matrices  A  £  Cnxrl  and  x  £  Cnxq,  and  k  £N,  our  task  is  to  compute  the  matrix  in  (2). 
In  this  work,  we  will  assume  that  the  polynomials  pj  in  (2)  satisfy  a  three-term  recurrence2 

p0(z)  :=  1,  pi 0)  :=  (z  -  a0)p0(2)/7o,  and 

Pj(z)  :=  (0  -  aj-i)Pj-i(z)  -  Pj-zPj-iiz)) hj~\  for  j  >  1. 


It  is  convenient  to  represent  the  polynomials  {pj }* _0  by  their  coefficients,  stored  in  a  tridiagonal  matrix 


Tk  := 


tridiag  ^ 


*  0o 
7o  7i 


fik- 2 

Ofc-1 

7fc-i 


xfc 


(4) 


3.1  Parallel  Matrix  Powers  Algorithms 

For  reference,  we  review  two  of  the  parallel  matrix  powers  algorithms  from  [4,  5,  8],  referred  to  by  the 
authors  as  PAO  and  PA1.  PAO  refers  to  the  naive  algorithm  for  computing  (2)  via  k  SpMV  operations, 
and  PA1  is  the  communication-avoiding  variant.  For  simplicity,  we  use  versions  of  PAO  and  PA1  which  do 
not  exploit  overlapping  communication  and  computation.  While  our  approach  can  be  extended  to  exploit 
overlapping  communication  and  computation,  the  potential  savings  are  limited  by  a  factor  of  2  and  thus 
will  not  affect  the  asymptotic  complexity  of  the  algorithms  given  here. 

Let  the  notation  nz(A)  =  {(i,j)  :  Atj  treated  as  nonzero}  represent  the  edges  in  the  directed  graph  of 
A1  and  let  the  notation  Ax  indicate  the  submatrix  of  A  consisting  of  rows  i  £  I.  To  simplify  the  discussion 
(without  affecting  correctness),  we  will  ignore  cancellation,  meaning  we  assume  nz (pj(A))  C  nz(pj+1(A)) 
and  every  entry  of  x ^  is  treated  as  nonzero  for  all  j  >  0. 

We  construct  a  directed  graph  Q  =  (V,£)  representing  the  dependencies  in  computing  tS-^  :=  pj(A)x 
for  every  0  <  j  <  k.  First,  denoting  row  i  of  by  x^\  we  define  the  n{k  +  1)  vertices  V  :=  {x\^  :  1  < 
i  <  n,0  <  j  <  k}.  The  edge  set  £  consists  of  k  copies  of  nz(A),  between  each  adjacent  pair  of  the  k  +  1 
levels  :=  {x^  :  1  <  i  <  n},  unioned  with  the  edges  due  to  the  polynomial  recurrence,  i.e., 


£  := 


„(j) 


(n 


0  <j<k, 
,i2)6nz(A) 


1  <d,'<d,  I 
0 <j<k-d',  > 
l<i<n  J 


where  d  is  the  number  of  nonzero  superdiagonals  of  Tk  (counting  the  main  diagonal).  We  ignore  possible 
sparsity  within  these  diagonals;  this  simplifies  the  discussion  without  affecting  correctness. 

Now  we  partition  V  ‘rowwise,’  that  is,  each  is  assigned  a  processor  affinity  m  £  {0, . . .  ,p  —  1},  for 
0  <  j  <  k.  Let  Vm  and  Vm  restrict  V  and  to  their  elements  with  affinity  m.  Let  TZ(S)  denote  the 
reachability  set  of  S  C  V,  i.e.,  the  set  S  and  vertices  reachable  from  S  via  (directed)  paths  in  Q;  then  as 
with  V,  we  define  the  subsets  lZm,  and  7 Zm  of  1Z. 

At  the  end  of  the  computation,  processor  m  has  computed/stored  (a  superset  of)  the  entries  Vm.  Thus, 
for  PAO,  processor  m  must  own  j4ri.xO)eV  y  and  Vm\  and  for  PA1,  processor  m  must  own  A^.  ^i )&nyV  ,j 
and  IZ^lVm)- 

With  this  notation,  we  present  the  parallel  matrix  powers  algorithms  PAO  (Alg.  1)  and  PA1  (Alg.  2), 
as  pseudocode  for  processor  m.  The  advantage  of  PA1  over  PAO  is  that  it  may  send  fewer  messages 
between  processors:  whereas  PAO  requires  k  rounds  of  messages,  PA1  requires  only  one.  If  the  number  of 
other  processors  t  m  that  each  processor  m  must  communicate  with  is  the  same  for  both  algorithms, 
PA1  obtains  a  fc-folcl  latency  savings.  In  general,  however,  PA1  incurs  greater  bandwidth,  arithmetic,  and 
storage  costs,  as  processors  may  perform  redundant  computations  to  avoid  communication  (see  Table  1). 
Whether  this  tradeoff  is  worthwhile  depends  on  many  machine  and  algorithm  parameters;  we  refer  to  the 
detailed  analysis  and  modeling  in  [4,  5,  8]. 


2We  see  no  obstruction  to  generalizing  to  longer  recurrences;  we  focus  on  three-term  recurrences,  commonly  used  in 
applications. 
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PAO 

Flops 

E  <?(2|nz(A,;)|  —  1) 

x[j)eVm 

1  <j<k 

Words 

E  E  9(l^m“1W))l  +  i^“1)c^))0 

j  =  1 

Msgs. 

k 

E  (|{*  *  ™  :  |^"1)(V/0))|  >  0} |  +  \{£  ^  m  :  | (V$)\  >  0}|) 

3  = 1 

Mem. 

nZ(^:x«)GVm})  +  9lV-l 

Flops 

E  g(2|nz(Aj)|  —  1) 

PA1 

l<j<k 

Words 

t^m 

Msgs. 

|{^  m  :  |ft£>(V,w)l  >  0}|  +  |{^  to  :  |<J(Vifc))|  >  0}  | 

Mem. 

nz(^{i:x«)eK(vm)})  +  «l7e(V™)l 

Table  1:  Complexity  of  PAO  and  PA1  for  processor  m,  in  terms  of  vertices  of  the  computational  graph  Q. 


Algorithm  1  PAO.  Code  for  processor  m.  Algorithm  2  PA1.  Code  for  processor  to. 


1 

for  j  =  1, . . . ,  k  do 

1 

for  all  processors  £  m  do 

2 

for  all  processors  f  ^  m  do 

2 

Send  x-0^  G  TZm\V^)  to  processor  £. 

3 

Send  G  7£p_1pVpp  to  processor  £. 

3 

Receive  x?-0^  G  7Z ^  (Vm ^ )  from  processor  £. 

4 

Receive  xp_1')  G  TZ^~X\Vm)  from  processor  £. 

4 

end  for 

5 

end  for 

5 

for  j  =  1, . . . ,  k  do 

6 

Compute  x'p  G  Vm  via  recurrence  (3). 

6 

Compute  xp)  G  7 Z^(Vm)  via  recurrence  (3). 

7 

end  for 

7 

end  for 
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3.2  Parallel  Blocking  Covers  Algorithm 

Consider  computing  (2)  with  a  dense  matrix  A.  As  before,  PAO  must  communicate  at  every  step,  but 
now  the  cost  of  PA1  may  be  much  worse:  when  k  >  1,  every  processor  needs  all  n  rows  of  A  and  x^°\  and 
so  there  is  no  parallelism  in  computing  all  but  the  last  SpMV  operation.  (Note  that  when  k  =  1,  PA1 
degenerates  to  PAO.)  If  A  can  be  split  in  the  form  D  +  USVH ,  where  D  is  well-partitioned  and  USVH 
has  low  rank,  we  can  use  a  generalization  of  the  blocking  covers  approach  [7]  to  recover  parallelism;  we 
call  this  algorithm  PA1-BC. 

Split  the  matrix  A  =:  D  +  USVH ,  and  write 

=  (( D  —  ak)x h'-1)  —  pj_2  x^~2^  +  USVH  x^~1",)/')j-\.  (5) 

Now  we  manipulate  the  three-term  recurrence  (3)  to  obtain  the  following  identity,  which  we  will  then  apply 
to  PA1  to  obtain  the  PA1-BC  (the  blocking  covers  algorithm).  Let  plj(z)  represent  a  degree- j  polynomial 
related  to  Pj{z)  by  reindexing  the  coefficients  (ay  7j)  :=  (a^+j,  Pi+j,  li+j)  in  (3). 

Lemma  3.1.  Given  the  additive  splitting  z  =  Z\  +  z2,  (3)  can  he  rewritten  as 

Po{z)  =Po{z1),  Pi(z)=p1{z1)  +  z2p0(z)/'y0r  and  for  j>  1, 

pj{z)  =pj{z1)  +  J2Jt=1Pi-i+1(z1)z2Pj-i(z)/'Yj-i. 

Proof.  The  result  is  readily  established  for  j  =  0  and  j  =  1:  supposing  it  holds  up  through  j  =  t , 

Pt+i{z )  =  ((z  -  at)pt(z)  -  Pt-iPt-i(z))/lt 

1  ,  ^  > 

=  Pt+i{zi)  +  z2pt{z)ht  H - 21  at)Pi-li+1(zi)z2Pt-i{z)/'Yt-i 

i=l 

_  ^1  J2plZ[(z1)Z2Pt-i-l{z)/"ft-i-1 

'*  i=l 


li+j-lPliz)  >  -  1 

li+j-iPJi  (z)  +  /3i+i_2p-_2(2)  i  >  1 


Observe  the  following  identities  (for  any  j  >  0  and  z):  jP0{z)  =  1,  and 

r~. . .  .j 

(z  —  Q!i+j-i)pj_1(z)  = 

Substituting  these  identities,  we  obtain 

t 

Pt+\{z)  =pt+1(zi)  +pt0+1(z1)z2Pt(z)/'Yt+'^2pl~t+1(z1)z2Pt-i{z)/'yt-i 

i= 1 

t+1 

=  Pt+i(zi )  +  ^Pill+2(zi)z2pt-i+i(2)/7t-i+i 


i— 1 


This  completes  the  induction.  □ 

Now  substitute  z  :=  A  =  D  +  USVH  =:  z\  +  z2  in  (6),  premultiply  by  SVH ,  and  postmultiply  by  x, 
to  obtain 

SVHpj{A)x  =  s(vHPj(D)x  +  Yj{VHpj-\+\D)U){SVHpJ_i{A)x/lj^ .  (7) 

i— 1 

To  simplify  (7),  we  define 

Wi  :=  VHPi{D)U  for  0  <  i  <  k  -  2,  (8) 

Hi  :=  VHpi  (D)x  for  0  <  i  <  k  —  1,  (9) 

hj  :=  SVHxu)  for  0  <  j  <  k  -  1,  (10) 


and  rewrite  the  polynomials  in  terms  of  the  original  polynomials  pi  =  p ?,  via  the  following  result. 
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(11) 


Lemma  3.2.  There  exist  coefficient  vectors  wf  €  CI+1  satisfying 

[Wo, Wffiwj  0  ffir)  :=  VHpl{D)U 

for  0  <  i  <  k  —  2, 1  <  j  <  k  —  i  —  1,  and  ifrey  can  fre  computed  by  the  recurrence 
wJ0  :=  1,  :=  (Ti  -  aj+j_i/2,i)t0o/7j-i)  and  for  *  >  1, 

wi  -=  ((A  —  W+j  —  lAi+l,i)^i_i  —  A+j  —  2^i+l,t—  lwi—o)/Ti+j  —  1) 
where  IrjC  denotes  the  leading  r-by-c  submatrix  of  an  identity  matrix. 

Proof.  For  any  j,  we  verify  the  claim  for  ?’  =  0  and  i  =  1;  supposing  it  holds  up  to  some  i  =  t.  Write 

Vh{(D-  at+j)pl(D)  -  /3t+j-iP,t-1(D))U/'yt+j, 

then  substitute 

PUD)U=  [Po{D)U, . . .  ,pi(D)U](wi  0  Ir,r) 
for  i  =  t  —  1  and  i  =  t.  The  conclusion  follows  from  rewriting  (3)  as 

VhD[Po(D)U,  . . .  ,Pi-i(D)U]  =  VH\p0(D)U, . . .  ,Pl(D)U}(Tx  0  Jr,r) 

(for  any  i),  and  substituting.  □ 


Substituting  (8),  (9),  (10),  and  (11)  into  (7),  we  obtain 

bj  =  s(yj  +  E-=1[W"0’ •  •  •  > "'-F •  K-ii+1  ®  Ir.rV’j  ,r:j  )  ■  (13) 

Ultimately  we  must  evaluate  (5),  which  we  rewrite  as 

xW  =  ((D  -  -  A- 2X^~2)  +  Ubj-i)/^j-i.  (14) 

Computing  . . .  ,x^]  by  this  recurrence  can  be  accomplished  by  applying  PA1  to  the  following  re¬ 
currence  for  polynomials  pj(z,c),  where  the  sequence  of  indeterminates  c  :=  (co,  ■  ■ . ,  Cj_ i, . . .)  represents 
the  terms  (Ubo, . . . ,  Ubj- 1, . . .): 


Po(z,c)  :=  1,  pi  (z,  c)  :=  ((z  -  a0)p0(z,c)  +  c0)/ j0,  and  for  j  >  1, 

Pj(z,c)  :=  ((z  -  -  ffi-2Pj-2{z,  c)  +  cj-^/jj-!. 


(15) 


Given  the  notation  established,  we  construct  PA1-BC  (Alg.  3). 

We  redefine  Q  =  (V,  £)  in  terms  of  the  graph  of  D  (as  opposed  to  that  of  A,  which  may  be  complete). 
Processor  m  must  own  the  following  rows  of  D,  U,  V,  and  adA 


D 


{i: 


„(!) 


eTC(vm)}’  U{i-.xffie'R.tym)Y  V{i-.x(f)evmy  and7e(0)(Vm)  X{i:x<s> )en(Vm)} 


(i) 

in  order  to  compute  the  entries  x)  €  V, 


Algorithm  3  PA1-BC.  Code  for  processor  m. 

1:  Compute  local  rows  of  Kk-i{D,  U,  Xfc_2)  with  PAl,  premultiply  by  local  columns  of  VH . 
2:  Compute  [1U0, . . . ,  Wc_2]  by  an  Allreduce. 

3:  Compute  wj  for  0  <  i  <  k  —  2  and  1  <  j  <  k  —  i  —  1,  via  (12). 

4:  Compute  local  rows  of  Kk{D,x^°\Tk- 1)  with  PAl,  premultiply  by  local  columns  of  VH . 
5:  Compute  [yo>  •  •  • ,  '!Jk-i\  by  an  Allreduce. 

6:  Compute  [fr0, . . . ,  frfc-i]  by  (13). 

7:  Compute  local  rows  of  [A°\  . . . ,  x^]  with  PAl,  modified  for  (14). 


In  exact  arithmetic,  PA1-BC  returns  the  same  output  as  PAO.  However,  by  exploiting  the  splitting 
A  =  D  +  USVH ,  PA1-BC  may  overcome  the  problems  of  PAl,  and  allow  us  to  avoid  communication  when 
A  itself  is  not  well  partitioned. 

One  can  think  of  the  sequential  blocking  covers  algorithm  in  [7]  as  a  special  case  of  a  sequential 
execution  of  PA1-BC.  Given  a  set  of  blocker  vertices,  indexed  IC{1,...,  n},  the  algorithm  in  [7]  removes 
the  outgoing  edges  from  those  vertices,  i.e. ,  removes  rows  T  from  A,  to  obtain  a  well-partitioned  matrix 
D.  In  our  notation,  this  means  setting  U  :=  [e*  :  i  €  T]  and  SVH  :=  Ax,  where  e*  is  the  ith  identity 
column. 
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PAO 

Flops 

(8s2  +  85  +  1  )kq(n/p) 

Words 

4skq(n/p  +  s ) 

Msgs. 

8  k 

Mem. 

((2s  +  l)2  +  q(k  +  1  ))(n/p)  +  4  sq(n/p)L/’2  +  4  s2q 

PAl 

Flops 

(8s2  +  8  s  +  l)(kq(n/p)  +  2sk2q(n/p )1^2  +  (4/3)s2fc3q) 

Words 

4  skq(n/p  +  sk) 

Msgs. 

8 

Mem. 

((2s  +  l)2  +  q(k  +  1  ))(n/p)  +  ((2s  +  l)2  +  1)(4  skq(n/pY,'z  +  4  s2k’zq) 

Table  2:  Complexity  comparison  of  PAO  and  PA1  for  computing  [ Dx , . . . ,  Dkx ]  on  p  processors,  when  D 
is  a  (2s  +  l)2-point  stencil  on  an  n1/z-by-n1^1  mesh.  These  results  are  modified  from  [4,  Table  1]  to  allow 
for  x  to  have  q  >  1  columns.  When  computing  the  3-term  recurrence  (3)  resp.  (15),  the  flop  costs  for  both 
algorithms  increase  by  additive  factors  of  (5 k  —  7 )q(n/p)  resp.  (6 k  —  8 )q(n/p). 


PAO 

Flops 

Fpaoo#,  <l)  +  kqr(4(n/p)  +  p) 

Words 

Wpao  (k,q)  +  2kqrp 

Msgs. 

Spao  (k,q)  +  klg(p) 

Mem. 

Mpao  (k,q)  +  2r(n/p) 

PAl-BC 

(offline) 

Flops 

FpAi(3)(fc  -  2,r)  +  krz(2(n/p)  +p) 

Words 

WpAi  (. k  —  2,  r)  +  2  fcr2p 

Msgs. 

Spai  (k  -  2,r)  +  lg (p) 

Mem. 

Mpai  (k  -  2,  r)  +  r(n/p) 

PAl-BC 

(online) 

Flops 

FpAi(3)(fc  -  1,  q)  +  FPai(3c)(&,  q)  +  kqr(4(n/p )  +p+  0(sk(n/p)1/z))  +  kz(k  +  q)rz 

Words 

Wpai  (k  -  1,  q)  +  WpAi(fc,  q)  +  2kqrp 

Msgs. 

Spai  (k-  1  ,q)  +  Spai  (k,q)  +  lg(p) 

Mem. 

MpAi(fc  -  l,g)  +  Mpai (k,q)  +  r(n/p)  +  0(skr(n/p)1/z ) 

Table  3:  Complexity  comparison  for  ‘stencil  plus  rank-r’  example,  showing  leading  order  constant  factors. 
‘Offline’  refers  to  Lines  1-3  and  ‘Online’  refers  to  Lines  4-7  of  PA1-BC.  The  functions  F,  W,  S,  M  denote 
flops,  words,  messages,  and  memory  according  to  Table  2;  the  ‘(3)’  resp.  ‘(3c)’  suffixes  in  the  subscripts 
of  F  indicate  the  3-term  recurrences  (3)  resp.  (15)  (see  caption  of  Table  2). 

3.3  Complexity  Analysis  for  a  Model  Problem 

To  demonstrate  the  potential  performance  benefits  of  PA1-BC  over  PAl  and  PAO,  we  consider  the  following 
example.  Let  A  =  D  +  UVH  be  a  dense  n  x  n  matrix,  where  the  graph  of  D  is  a  (2s  +  l)2-point  stencil 
on  a  ?r1/2-by-n1/2  mesh,  and  U  and  V  are  dense  n-by-r  matrices.  We  assume  p  and  n  are  perfect  squares 
and  p1/2  divides  n1/2,  and  we  partition  the  vertices  (1, . . . ,  n}  so  that  each  processor  owns  a  (n/p)1/2- by- 
(n/p)1/2  square  of  the  domain.  Following  [4],  we  assume  k  <  (n/p)1/2,  so  that  PAl  (applied  to  D)  need 
only  communicate  among  neighboring  processors.  We  give  the  complexity  for  PAO  and  PAl  in  Table  2, 
modified  from  [4,  Table  1]  to  allow  x  to  have  q  >  1  columns,  and  to  use  the  three-term  recurrences  (3) 
and  (15). 

We  assume  our  collective  communications  attain  the  lower  bounds  in  [2,  Table  1].  We  consider  Lines  1- 
3  of  PAl-BC  as  ‘offline’  operations,  and  Lines  4-7  as  ‘online’  computations,  since  typically  the  matrix  A 
and  the  polynomials  pj  are  fixed  across  many  calls  to  PAl-BC  (or  PAO/1),  while  the  input  matrix  x 
changes  on  each  invocation.  We  compare  PAO  and  PAl-BC  in  Table  3. 

We  use  the  following  arithmetic  counts:  apply (S,  ni,  n?)  =  0  is  the  cost  of  applying  S  to  an  ni-by-ri2 
matrix  (in  this  example  S  is  the  identity),  add(ni,ri2)  =  scal(ni,7i2)  =  niri2  is  the  cost  of  adding  two 
ri\ -by-ri'2  matrices  or  multiplying  one  by  a  scalar,  and  mm(ni,n2,ri3)  =  2n\n2n3  —  niu.3  is  the  cost  of 
multiplying  matrices  of  size  ni-by-n2  and  ?Z2-by-n3. 

PAO  We  modify  PAO  slightly  to  exploit  the  splitting  Ax  =  Dx  +  U(VHx)  to  save  computation  and 
communication.  We  assign  each  processor  m  a  block  row  (indices  {i  :  G  Vm})  of  D,  U,  V,  and  x^  — 

a  nonoverlapping  partition  of  each,  as  opposed  to  PAl,  whose  partitions  may  overlap  when  k  >  1.  We  first 


Figure  1:  An  example  of  a  hierarchical  semiseparable  matrix,  with  £  =  2. 


compute  y  =  VH x  via  a  local  matrix  multiplication  (costing  mm(r,  n/p ,  q)  flops)  followed  by  an  Allreduce 
of  y  ((p  —  1  )rq  flops,  2 (p  —  1  )rq  words,  and  lg(p)  messages).  Then  we  compute  Uy  by  a  local  matrix 
multiplication  (costing  mm (n/p,r,q)  flops).  Then  we  communicate  rows  of  x^’  in  order  to  compute  the 
local  rows  of  Dx^;  the  costs  of  this  step  follow  from  Table  2. 

PA1-BC  PAl-BC  makes  3  calls  to  PA1,  in  Lines  1,  4,  and  7.  These  costs  are  shown  in  Table  2  (note 
that  the  third  PA1  call  uses  the  PAl(3c)  variant). 

Now  we  analyze  the  applications  of  VH  (Lines  1  and  4)  and  subsequent  Allreduce  collectives  (Lines  2  and  5). 
Applying  the  local  columns  of  VH  costs  mm (r,  n/p,  (k  —  1  )r)  (resp.  mm(r,n/p,kq))  flops,  and  the  sub¬ 
sequent  Allreduce  costs  {p  —  l)(fc  —  l)r2  flops  and  2 (p  —  1  )(k  —  1  )r2  words  (resp.  (p  —  1  )krq  flops  and 
2 (p  —  1  )krq  words),  and  lg(p)  messages  (both). 

Computing  wj  in  Line  3  costs 


k— 2 k—i—  1 

(3k  -  2)  +  E  E  8i  +  5  =  4/3/c3  +  0(1) 

*= 2  3= 1 

flops  (performed  locally),  and  computing  bg  through  bk-i  in  Line  6  costs 

fc— l  j 

E  (aPP1y('5>  r’  Q)+3-  add(r,  q)  +  ^  (  mm(r,  r,  q)  +  scal(r,  q)  +  i  ■  scal(r,  r)  +  (i  —  l)add(r,  r)j  ''j 

j—O  i— 1 

<  k2(k  +  q)r2 

local  flops.  Note  that  if  Tk  is  Toeplitz  (e.g.,  {pj}  are  monomials),  then  we  do  not  need  to  compute  wj, 
as  each  is  an  identity  column;  furthermore,  the  upper  bound  k2(k  +  q)r2  above  reduces  to  k2(0(  1)  +  q)r2 
flops. 

Lastly,  computing  U[bo, . . .  ,bk- i]  in  Line  7  requires  a  local  computation  of 

mm  (n/p  +  Ask^/p)1^2  +  4s2fc2,  r,  kq)  <  2  kqr(n/p)  +  8  sfc2gr(n/p)1^2  +  0(s2k3qr) 


flops. 


4  Extension  to  Hierarchical  Matrices 

Hierarchical  (TL-)  matrices  (mentioned  above)  are  amenable  to  the  splitting  D  +  USVH .  Particularly 
interesting  are  a  special  class  of  77-matrices  called  hierarchical  semiseparable  (HSS)  matrices.  We  briefly 
review  the  HSS  representation,  using  the  notation  in  [3]  (see  also  Fig.  1).  For  any  L  £  {0, . . . ,  [lg«J},  we 
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can  write  A  hierarchically  by  recursively  defining  its  diagonal  blocks  as  A  =:  Dq-i  and  for  1  <  £  <  L  and 
1  <  i  <  2e~1, 

y-v  _ _  Dg.\2i  —  1  U t\2i  —  l,2iV^2i 

t~1'i  [  Ue.,2iBe.2it2i_ De.2i  J  ’ 

whose  off-diagonal  blocks  are  defined  recursively  by 


Ue-i-i  —■ 


ut  ;2  i  —  i  Re  ;  2  i  —  1 

Ui-2iRe  ;2i 


and  Vt-i-i  —■ 


vt  ;2i  —  iW<  ;  2  i  —  1 

Vg;2tW]g;2* 


for  2  <  £  <  L  and  1  <  «  <  2^_1,  and  of  course  are  empty  at  level  0:  i/o;i>Vo;i  :=  [].  The  subscript  £ 
indicates  the  level  in  a  perfect  binary  tree  of  depth  L,  and  i  indexes  the  2r  vertices  at  depth  £. 

The  action  of  A  on  a  matrix  x,  i.e. ,  v  :=  Ax,  satisfies  t>0;i  =  I^o-,ixO;ii  and  for  1  <  £  <  L  and 
1  <i  <2e,  satisfies  ve;i  =  D^xi-i  +  with  /i;i  =  -Bi;i;2flri;2,  /i;2  =  and  for  1  <  £  <  L  -  1 

and  l<i<2e, 


pT 

-ft£+l;2i-l 

T 

ft;i 

rt 

JD£  +  l-,2i-l,2i 

9t+l-,2i 

/t+l;2i-l  — 


where  for  1  <  £  <  L  —  1  and  1  <  i  <  2e,  g ^  = 


and  fe+i-2i  — 


jdT 

IX£+l;2i 

T 

h;i 

rt 

JD£  +  l-,2i,2i-l 

9£+l\2i-l 

W£+i-2i-l 
W^+i;2  i 


H 


atM+lW  ]  >  and  9L;i  =  V^XL-i  for  1  <  *  <  2l. 


For  any  level  £  we  assemble  the  block  diagonal  matrices 


Ue  :=  ^ 


y 


0i=i Vt* 


:=  ®j=1-D^;i> 


(16) 


denoted  here  as  direct  sums  of  their  diagonal  blocks.  We  also  define  matrices  Si,  representing  the  recur¬ 
rences  for  fi-i  and  gi-i,  satisfying 

v  =  Ax=:  Dex  +  UiSiViHx.  (17) 

We  will  now  discuss  parallelizing  v  =  Ax,  to  generalize  PAO  and  PA1  to  HSS  matrices.  We  make  the 
following  assumptions  for  the  subsequent  sections.  The  matrix  A  is  n-by-n,  and  its  HSS  representation 
has  a  perfect  binary  tree  structure  to  some  level  L  >  2.  We  have  p  >  4  processors,  and  for  simplicity 
assume  that  p  is  a  power  of  2.  For  each  processor  m  €  {0, 1, . . .  ,p  —  1},  let  Lm  denote  the  smallest  level 
£  >  1  such  that  p/2t  divides  m.  We  also  define  the  intermediate  level  1  <  Lp  :=  lg(p)  <  L  of  the  HSS 
tree;  each  Lm  >  Lp,  where  equality  is  attained  when  m  is  odd. 


4.1  PAO  for  HSS  Matrices 

Given  the  notation  and  assumptions  in  the  preceding  section,  we  show  how  to  modify  PAO  when  A  is 
HSS,  exploiting  the  v  =  Ax  recurrences  for  each  1  <  j  <  k  —  the  result  is  called  PAO-HSS.  For  clarity, 
we  write  the  parallel  upsweep/downsweep  subroutine  separately,  in  Alg.  5. 

Our  PAO-HSS  algorithm  is  based  on  the  serial  algorithm  for  HSS  matrix- vector  multiplication  from  [3], 
summarized  in  the  recurrences  in  the  preceding  section.  We  are  unaware  of  previous  work  demonstrating 
a  parallelization  of  these  recurrences,  although  it  is  obvious,  given  the  perfect  binary  tree  structure.  First, 
on  the  upsweep,  each  processor  locally  computes  x  (its  subtree,  rooted  at  level  Lp  =  lg(p))  and  then 
performs  Lp  steps  of  parallel  reduction,  until  there  are  two  processors  active,  and  then  a  downsweep  until 
level  Lp,  at  which  point  each  processor  is  active,  owns  SlpV^x,  and  recurses  into  its  local  subtree  to 
finally  compute  its  rows  of  v  =  Dlx  +  UlSlVlX.  More  precisely,  we  assign  processor  m  the  computations 

fa  and  for  {£,i  '■  2em/p+L<i<2e\m+l)/p}  &Ild  fof  i  :  ^=2^+?} 

(so  only  processors  0  and  p/2  are  active  at  the  top  level  [£  =  1)),  and  the  matrices  Dl,  Ul,  and  Vl  are 
distributed  contiguously  block  rowwise,  so  processor  m  stores  blocks  Dl  .m+i,  f7z,p.m+i,  and  VLp-m+ 1- 
The  Ri-i,  Wi-i,  and  B i-i  matrices  are  distributed  so  that  they  are  available  for  the  computations  in 
the  upsweep/downsweep  (Alg.  5);  we  will  omit  further  details  for  brevity,  but  will  analyze  the  memory 
requirements  when  we  compare  with  PA1-HSS,  below. 
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Algorithm  4  PAO-HSS.  Code  for  processor  to. 

1:  for  j  =  1, . . . ,  k  do 

2:  Perform  upsweep  and  downsweep  (Alg.  5)  to  compute  local  rows  of  Ax^~x\ 
3:  Compute  x^  G  Vm  locally. 

4:  end  for 


Algorithm  5  Upsweep  and  downsweep  for  PAO-HSS.  Code  for  processor  to. 


1 

UPSWEEP: 

24 

i  =  2  ^m/p  +  1. 

2 

for  t  =  L,  L  —  1, . . . ,  Lp  do 

25 

Compute  fe+i-,2i-i- 

3 

for  i  =  2^m/p  +  1, . . . ,  2 e(m  +  1  )/p  do 

26 

if  l  <  Lp  —  1  then 

4 

Compute  ge-i 

27 

Send  fe+i-2i-i  to  proc.  m  +  p/2e+2. 

5 

end  for 

28 

end  if 

6 

end  for 

29 

else  if  p/ 2f-+1  divides  to  then 

7 

for  £  =  Lp  —  1, . . . ,  1,  0  do 

30 

i  =  2e"m/p  +  1/2. 

8 

if  p/ 2e  divides  to  then 

31 

if  t  >  0  then 

9 

i  =  2  ^m/p  +  1. 

32 

Receive  from  proc.  to  —  p/ 2e+1. 

10 

Send  gt+i-2i—i  to  proc.  m+p/2^+1. 

33 

end  if 

11 

Receive  ge+i-2i  from  proc.  m  +p/2e+1. 

34 

Compute  fe+i-2i- 

12 

if  t  >  0  then 

35 

if  l  <  Lp  —  1  then 

13 

Compute  g^-i. 

36 

Send  fe+i-2i  to  proc.  to  +  p/2e+2. 

14 

end  if 

37 

end  if 

15 

else  if  p/2i+1  divides  to  then 

38 

end  if 

16 

i  =  2^m/p  +  1/2. 

39 

end  for 

17 

Send  ge+i-.2i  to  proc.  m  —  p/ 2e+1. 

40 

for  £  =  Lp, ...  ,L  —  1  do 

18 

Receive  ge+i-2i-i  from  proc.  m—p/ 2e+1. 

41 

for  i  =  2 em/p  +  1, . . . ,  2 e(m  +  1  )/p  do 

19 

end  if 

42 

Compute  fe+i-2i-i  and  ft+ 1;2i. 

20 

end  for 

43 

end  for 

21 

DOWNSWEEP: 

44 

end  for 

22 

for  t  =  0, . . . ,  Lp  —  1  do 

45 

Compute  VL-i  for  i=mn/ (pr)+ 1, . . . ,  (to+1) 

23 

if  p/2e  divides  to  then 

11 


4.2  PA1  for  HSS  Matrices 


The  block-diagonal  structure  of  Dg,  Ug,  and  Vf  in  (16)  suggests  an  efficient  parallel  implementation  of  PA1- 
BC,  which  we  present  as  PA1-HSS  (Alg.  6).  As  opposed  to  PAO-HSS,  the  only  parallel  communication  in 
PA1-HSS  occurs  in  two  Allgather  operations,  in  Lines  1  and  5.  The  computation  cost  increases,  however, 
since  each  processor  performs  the  entire  upsweep/downsweep  between  levels  1  and  Lp  locally.  Recall  in 
PAO-HSS,  there  was  parallelism,  albeit  less  than  full,  during  these  levels  (Lines  7-39  of  Alg.  5).  Our 
further  loss  of  parallelism  shows  up  in  our  complexity  analysis  (see  Table  4)  as  a  factor  of  p,  compared  to 
a  factor  of  lg(p)  in  PAO-HSS;  we  also  illustrate  this  tradeoff  in  our  performance  modeling  (see  §5). 

We  assume  the  same  data  layout  as  PAO-HSS:  each  processor  stores  a  diagonal  block  of  Dlv,  Ulp,  and 
Vlp  (but  only  stores  the  smaller  blocks  of  level  L).  We  assume  each  processor  is  able  to  apply  Slp-  We 
rewrite  (14)  for  the  local  rows,  and  exploit  the  block  diagonal  structure  of  D^p  and  Ulp,  to  write 


-O') 

bLp,m+ 1 


=  ,m+ 1  aj-l)xLp,m+l  fij-2XLp,m+l  +  ULp,m+l(bj-l){mr+l,...,(m+l)r}  )  /7j-l-  (18) 


„(i- 2) 


We  will  not  exploit  the  fact  that  each  processor  ultimately  needs  only  a  subset  of  the  rows  of  bj  computed 
in  Line  6.  Each  processor  locally  computes  all  rows  of  bj  =  SjJp  V/r  A7')  =  Slp  •  z,  where  2  is  the  maximal 
parenthesized  term  in  (13),  using  the  HSS  recurrences: 


V?xW  =  z  =:  [g[ 


T 

Lp,l 


T  1 T  r  fT 
9lp,p\  ^  [fLp,  1 


SlpV[ 


The  rest  of  PAl-HSS  is  similar  to  PA1-BC,  except  that  the  Allreduce  operations  have  now  been  replaced 
by  Allgather  operations,  to  exploit  the  block  structures  of  Wt  and  yt. 


Algorithm  6  PAl-HSS  (Blocking  Covers).  Code  for  processor  m. 

1:  Compute  Kk^(DLp.m+1,ULp.m+1,Tk_2),  premultiply  by  V/^.m+1. 

2:  Compute  [Wo, . . . ,  Wc_2]  by  an  Allgather. 

3:  Compute  wj  for  0  <  i  <  k  —  2,  and  1  <  j  <  k  —  i—  1,  via  (12). 

4:  Compute  A'fe(T>LJ);ro+i,a;^);TO+1,Tfc_i),  premultiply  by  V%.m+1. 

5:  Compute  [yo >  •  •  • ,  Uk-i\  by  an  Allgather. 

6:  Compute  [bo,  •  ■  • ,  bk-i]  by  (13),  where  S  =  Slp  is  applied  as  described  above. 
7:  Compute  local  rows  of  . . . ,  x^]  according  to  (18). 


4.3  Complexity  Analysis  for  a  Model  Problem 

We  compare  the  asymptotic  complexity  of  PAO-HSS  and  PAl-HSS.  We  assume  A  is  n-by-n  and  dense, 
and  represented  by  the  dense  r-by-r  matrices  Rty  and  Wty  (for  levels  2  through  L),  (for  levels  1 

through  L)  and  Dm,  Up-i,  and  Vp:i  (i.e. ,  at  the  leaf  level).  To  simplify  the  presentation,  we  assume  n 
and  r  are  powers  of  2  and  L  =  lg (n/r).  A  matrix  satisfying  these  assumptions  is  said  to  have  HSS  rank 
r;  here  we  assume  A  is  already  represented  as  such  a  set  of  r-by-r  matrices.  We  summarize  the  results  in 
Table  4. 

PAO-HSS  Computing  the  local  portion  of  the  upsweep  and  downsweep  (levels  L  through  Lp,  i.e., 
Lines  2-6,40-45  of  Alg.  5)  costs 


L— 1 

(2L/p)mm(r,  r,  q)  +  ^  2(2</j))mm(r,  2 r,  q)  +  (2L/p)mm(r,  2 r,  q) 

l=Lp 

flops  and  no  communication.  For  the  intermediate  lines,  we  follow  processor  0,  who  is  active  on  every 
level.  Processor  0  computes  one  (jeAi  for  (.  =  Lp,Lp  —  1, . . . ,  1,  performing  Lp  +  1  sends  and  receives  of 
size  rq),  and  then  computes  one  for  £  =  1 ,Lp,  sending  messages  (of  size  rq)  Lp  —  1  times.  Thus, 
2Lpmm(r, 2r, q)  flops.  The  remaining  cost  is  computing  the  three-term  recurrence  (locally),  which  is  an 
additional  (5 k  —  7)q(n/p)  flops. 
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PAO-HSS  requires  enough  memory  to  store  the  local  blocks  of  Dl,  Ul,  and  Vl,  a  total  of  3 rn/p 
words.  Furthermore,  each  processor  must  store  the  Re-i,  We-,%  and  i  matrices  for  its  local  up¬ 
sweep /downs  weep,  a  total  of  (2'e/yj)2r2  words,  and  some  subset  of  these  matrices  for  the  parallel 

portion  of  the  upsweep/downsweep.  Processor  0  must  be  able  to  store  2 Lp  ■  2 r2  more,  an  upper  bound  for 
the  other  processors.  Lastly,  each  processor  must  be  able  to  store  their  (k  +  l)qn/p  entries  of  a/°\  . . . ,  . 


PA1-HSS  As  opposed  to  PA1-BC,  Lines  1,  4,  and  7  of  PA1-HSS  will  not  require  communication,  due 
to  the  block  diagonal  structure  of  D1  U,  and  V.  We  assume  all  three  lines  are  implemented  to  exploit  the 
upsweep/downsweep  recurrences  for  (locally)  applying  the  (HSS  rank  r)  matrix  I?Lp;m+i  to  an  n/p-by-r 
matrix  (Line  1)  or  to  an  n/p-by-q  matrix  (Lines  4  and  7).  (Note  that  in  Line  1  we  could  further  expand 
ULp-m+i  into  a  block  diagonal  matrix  (with  r-by-r  blocks),  using  the  matrices;  however,  this  seems 
to  only  increase  the  arithmetic  cost.)  The  arithmetic  cost  for  (locally)  multiplying  Dl  -m+ 1  (with  HSS 
rank  r)  by  a  dense  n/p-by-q  matrix  is: 

>_LTP+,l+  (2L/P)mm(r,  2  r,  q) 

—  -Lp  ~r  1 

<  18  qr(n/p)  +  20  qr2 

flops.  We  apply  Dl  ;m+ 1  k  —  2,  k—  1,  and  k  times,  in  lines  Line  1,  Line  4,  and  Line  7,  resp.  Evaluating  the 
three-term  recurrences  (in  the  same  three  lines)  increases  the  cost  by  (5k—17)r(n/p),  (5k— 12)  q(n/p),  and 
(6k  —  8 )q(n/p)  flops,  resp.  Then  applying  V^.m+1  in  Lines  1  and  4  costs  mm(r,  n/p,  r)  and  mm(r,  n/p,  q) 
flops,  resp.  (note  that  it  would  nearly  quadruple  these  costs  to  apply  m+1  using  the  HSS  upsweep 
recurrences,  rather  than  as  a  full  matrix).  The  Allreduce  operations  in  PAl-BC  have  become  Allgather 
operations  in  Lines  2  and  5,  due  to  the  structure  and  parallel  layout  of  the  matrices  W,  =  VHpi(D)U 
and  pi  =  VH pi(D)x.  Since  we  parallelize  at  HSS  level  Lp  (rather  than  the  leaf  level  L),  each  processor 
distributes  k  —  1  r-by-r  (resp.  k  r-by-q )  blocks.  This  costs 

0  flops,  (k  —  l)r2(p  —  1)  resp.  kqr(p  —  1))  words,  lg(p)  msgs 

Line  3  (performed  locally)  has  the  same  0(k 3)  cost  as  in  PAl-BC,  a  lower  order  term.  The  analysis  of 
Line  6  (also  performed  locally)  is  similar  to  that  of  PAl-BC,  except  now  r  is  replaced  by  pr ,  except  for 
the  terms  involving  IT; ,  which  is  block  diagonal  with  p  r-by-r  diagonal  blocks.  In  total,  each  processor 
performs 


(2</j>)mm(r,  r,  q) 
(2</p)mm(r,  2r,  q) 


(2</p)mm(r,  2 r,  q) 
(2</p)mm(r,  r,  q) 


Y  (aPPMS.-PA  Q) +3-  add  (pr,  q)  +  £(  mm(pr,  r,  q)  +  scal(r,  q)+i-  scal(pr,  r)  +  (i  —  l)add(pr,  r) 

j—0  i— 1 

flops,  where  each  application  of  S  to  a  pr-by-q  matrix  costs 


Lp 

2^mm(r,  2r,  q) 
e=i 


J  2<mm(r,  2 r,  q)  £  >  1 

JTi  { (2e/p)mm(r,  r,q)  £=1 


1 6pqr2  +  0(qr) 


flops. 

The  memory  requirements  for  PAl-HSS  are  the  same  as  PAO-HSS,  except  that  each  processor  needs 
all  the  R,(:i,  We-i,  and  B^i  matrices  for  levels  1  through  Lp,  a  cost  of  8 pr2  words. 


5  Performance  Modelling 

We  model  speedups  of  our  new  algorithms  for  the  two  example  problems  discussed  in  the  text:  a  2D  stencil 
plus  rank-r  component,  and  an  HSS  matrix.  Complexity  counts  used  can  be  found  in  Tables  3  and  4, 
resp.  We  use  two  machine  models  used  in  [8]  -  ‘Peta,’  an  8100  processor  petascale  machine,  and  ‘Grid,’ 
125  terascale  machines  connected  via  the  Internet.  The  Peta  machine  has  a  flop  rate  of  7  =  2  •  10-11 
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PAO-HSS 

Flops 

kq(14r  +  0(l))n/p  +  8kqr2  lg (p) 

Words 

kq(3r  +  0(l))\g(p) 

Msgs. 

3  k  lg(» 

Mem. 

(7r  +  (k  +  1  )q)n/p  +  4r2  lg(p) 

PA1-HSS 

(offline) 

Flops 

18 kr2n/p 

Words 

krzp 

Msgs. 

lg  ip) 

Mem. 

(7  r  +  (k  +  1  )q)n/p  +  16  rzp) 

PA1-HSS 

(online) 

Flops 

kq(36r  +  0(\))n/p  +  (fc2/3  )(k  +  10  q)rzp 

Words 

kqrp 

Msgs. 

lg  (P) 

Mem. 

(Jr  +  (k  +  1  )q)n/p  +  16  rzp 

Table  4:  Complexity  comparison  for  parallel  HSS  algorithms  PAO-HSS  and  PA1-HSS,  showing  leading 
order  constant  factors.  ‘Offline’  refers  to  Lines  1  -3  and  ‘Online’  refers  to  Lines  4-  7  of  PA1-HSS. 
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Figure  2:  Predicted  speedups  for  model  problem:  2D  9-point  stencil  +  rank-1  dense  component.  Param¬ 
eters  obtained  from  performance  modeling  in  [8].  Left:  Speedups  for  PAl  over  PAO  for  various  k  values 
on  Peta.  Right:  Speedups  for  PAl  over  PAO  for  various  k  values  on  Grid. 


s/flop,  latency  cost  a  =  10-5  s/message,  and  bandwidth  cost  /?  =  2  •  10-9  s/word.  The  Grid  machine  has 
flop  rate  7  =  10-12  s/flop,  latency  cost  a  =  1CT1  s/message,  and  bandwidth  cost  /3  =  25  •  10-9  s/word. 

For  the  stencil  plus  rank-r  test,  we  ran  the  performance  model  for  q  =  r  =  s  =  1,  i.e.,  a  2D  9-point 
stencil  plus  rank-1  matrix  (times  a  vector).  We  picked  n  and  k  based  on  those  chosen  in  [8];  note  that 
the  dimension  of  A  is  n2.  Fig.  2  shows  the  best  speedup  of  PA1-BC  over  PAO  over  all  p  values  such  that 
p  <  min(pmax,  ( n/k )2).  The  p  values  which  resulted  in  the  maximum  speedup  for  this  test  problem  are 
shown  in  Fig.  3.  On  Peta,  the  largest  speedup  was  7.8x.  Speedups  were  generally  higher  on  Grid,  with  a 
maximum  speedup  of  25.9  x  for  this  range  of  k.  We  expect  higher  speedups  on  Grid  since  PAO  is  extremely 
latency  bound  on  this  machine.  For  both  models,  predicted  speedups  decrease  with  increasing  n  and  k 
due  to  growing  additional  flop  and  bandwidth  terms. 

Speedups  of  PA1-HSS  over  k  invocations  of  PAO-HSS,  for  both  Peta  and  Grid,  are  shown  in  Fig.  4.  We 
use  the  parameter  triplets  (m,pi,ri)  where  p  =  (4,16,64,256,1024,4096),  n=  (2.5,5,10,20,40,80)  •  103, 
and  r  =  (5,  5,  5,  5, 6,  7),  based  on  the  parameters  for  parallel  HSS  performance  tests  in  [10].  Note  that  for 
Grid  we  only  use  the  first  3  parameter  triplets  since  pm ax  =  125.  On  Grid,  PAO-HSS  is  extremely  latency 
bound,  so  our  3 k  reduction  in  latency  results  in  a  3 fcx  faster  algorithm  (the  extra  flop  and  bandwidth 
costs  are  insignificant  for  these  values  of  k).  This  is  the  best  we  can  expect.  As  shown  in  Fig.  5,  for  very 
large  k ,  the  extra  terms  begin  to  dominate,  and  speedups  eventually  begin  to  decrease  with  k.  On  Peta, 
we  see  0(k)  speedups  for  smaller  p  and  /c,  but  as  these  quantities  increase,  the  expected  speedup  drops. 
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Figure  3:  Best  p  values,  used  to  obtain  speedups  in  Fig.  4.  Left :  Peta  machine  model,  where  pm ax  =  8100. 
Right :  Grid  machine  model,  where  pmax  =  125. 


This  is  due  to  the  extra  factor  of  p/\g{p)  in  the  bandwidth  cost  and  the  factor  of  k2p/\g(p)  in  the  flop 
cost  of  PA1-HSS.  Since  the  relative  latency  cost  is  lower  on  Peta,  the  effect  of  the  extra  terms  becomes 
apparent  for  larger  k  and  p. 


6  Future  Work  and  Conclusions 

In  this  work,  we  derive  a  new  parallel  communication-avoiding  algorithm  for  the  matrix  powers  compu¬ 
tation  with  A  =  D  +  USVH ,  where  D  is  well  partitioned  and  USVH  has  low  rank  but  may  not  be  well 
partitioned.  This  is  a  significant  improvement  over  the  algorithms  in  [4,  8],  which  require  A  to  be  well 
partitioned.  Our  approach  allows  us  to  exploit  the  low-rank  structure  to  asymptotically  reduce  the  paral¬ 
lel  latency  cost:  on  latency-bound  problems,  our  model  indicates  an  0(k)  speedup.  We  demonstrate  the 
generality  of  our  parallel  blocking  covers  technique  by  applying  it  to  matrices  with  hierarchical  structure. 
Detailed  performance  modeling  suggests  24x  speedups  on  petascale  machines,  and  up  to  3 k  speedups 
on  extremely  latency-bound  machines,  despite  growth  in  arithmetic  and  bandwidth  costs.  Future  work 
includes  a  high-performance  parallel  implementation  of  our  matrix  powers  kernel  variants  to  verify  the 
speedups  predicted  by  performance  modeling.  We  also  plan  to  explore  the  application  of  blocking  covers 
to  other  parallel  algorithms. 
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for  various  k  values  on  Grid.  We  use  parameters  p  =  {4, 16, 64},  n  =  {2.5, 5, 10}  x  103,  r  =  {5,  5,  5}. 


Figure  5:  Speedup  versus  k  for  p  =  {4, 16,64},  for  HSS  problem  on  Grid  machine  model. 
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